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To understand the formation, evolution, and function of complex systems, it is crucial to un- 
derstand the internal organization of their interaction networks. Partly due to the impossibility 
of visualizing large complex networks, resolving network structure remains a challenging problem. 
Here we overcome this difficulty by combining the visual pattern recognition ability of humans with 
the high processing speed of computers to develop an exploratory method for discovering groups 
of nodes characterized by common network properties, including but not limited to communities 
of densely connected nodes. Without any prior information about the nature of the groups, the 
method simultaneously identifies the number of groups, the group assignment, and the properties 
that define these groups. The results of applying our method to real networks suggest the possibility 
that most group structures lurk undiscovered in the fast-growing inventory of social, biological, and 
technological networks of scientific interest. 



The highly structured internal organization of complex 
networks can both impact and reflect their dynamics and 
function . Previous work on identifying and studying 
this organization has focused mainly on network com- 
munities ^ , which are subsets of nodes defined by the 
difference between their internal and external link den- 
sity. To provide a fresh perspective on this problem, we 
seek to capture more general structures characterized by 
other network properties' ' . For this purpose, we in- 
troduce the notion of structural groups, defined as subsets 
of nodes sharing common structural properties that set 
them apart from other nodes in the network. Using a 
given set of p3>l node properties (such as centrality and 
spectral properties) as the coordinates for each node in 
the p-dimensional space W , we identify structural groups 
as clusters of points in this node property space. Figure 1 
shows an illustrative example of a network for which no 
standard network visualization shows clear group struc- 
ture (Fig. la). However, an appropriate two-dimensional 
projection in the node property space reveals a hidden, 
but unambiguous three-group structure (Fig. lb), which 
can be used to generate a far more informative layout of 
the network (Fig. le). Application of existing community 
detection methods " is not expected to resolve these 
groups, since they are not distinguishable by link density 
alone (Fig. Ic). Neither is the direct application of ex- 
isting clustering methods in the full node property space 
nor in the projection onto any lower-dimensional space, 
due to the known fact that groups with widely different 
scatter sizes may not be correctly grouped by unsuper- 
vised algorithms (Fig. Id and Supplementary Fig. SI). 
Distinguishing structural groups may in general require 
a combination of two or more properties — Fig. lb shows 
that the degree and the average degree of neighbors suf- 
fice for this example. It is difficult, however, to identify 



such a combination without knowing the groups a priori. 

Our approach overcomes these difficulties using the vi- 
sual processing ability of a human user as an integral part 
of the analysis. The approach is based on visual analyt- 
ics ' , which is conceptualized as exploratory statistics 
in which analytical reasoning is facilitated by a visual in- 
teractive interface. Humans generally excel automated 
computer algorithms in visual recognition tasks, such as 
labeling images '" and deciphering distorted texts, which 
forms the basis of spam prevention systems and crowd- 
sourcing for the digitalization of old books . We ex- 
ploit this capability by asking the user to inspect a selec- 
tion of two-dimensional projections of the node property 
space for possible separation of nodes into groups. Since 
any projection could potentially reveal good separation 
of groups, we first consider the result of choosing these 
projections randomly. For two clusters of points with a 
gap between them in high dimension, the probability can 
be very small for the clusters to be separable by a straight 
line in a random two-dimensional projection. This prob- 
ability depends strongly on the "effective dimension" of 
the clusters. For example, if two Gaussian-distributed 
clusters of 100 points have their centers 6 units apart 
in the 28-dimensional space, the probability is less than 
0.001 if the variance of the clusters in every direction 
is one, but increases to about 0.017 if the variance is 
reduced by a factor of 10 in all but 10 orthogonal direc- 
tions. We find that the effective dimension is relatively 
small for the groups discovered in the networks consid- 
ered here, most of them with dimension less than 12 (out 
of 28) when defined as the minimum number of principal 
components required to account for 90% of the variance 
within the group. To further enhance the probability of 
separating groups, we sample random projections with a 
systematic bias (see Methods). This increases the sep- 
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Figure 1: Discovering hidden group structure beyond 
density-based communities, a, Visualization of a network 
by the Giirsoy-Atun algorithm , which attempts to place 
nodes uniformly while keeping the network neighbors close. 
This and other standard layout algorithms fail to disentangle 
the network and reveal any clear group structure, b, Us- 
ing our visual analytics method, a user can discover three 
structural groups (of sizes 150, 50, and 30) without a priori 
information about the number of groups. The groups can be 
characterized by the degree and the neighbors' average de- 
gree, and at least two properties are necessary to resolve the 
entire group structure, c. Even the most general community 
detection method^ ' does not divide the network correctly, d, 
The .ft'-means algorithm' ', one of the most frequently used 
methods for general clustering problems, does not correctly 
capture the group structure when applied directly to the full 
node property space, even if the number of groups A" = 3 is 
given, e. Layout of the network using the discovered groups. 
For clarity, both panels a and e show only 10% of the links. 



aration probability for the example of Gaussian clusters 
above to around 0.68 for a single projection. If the user 
visually recognizes separation of nodes into groups in a 
two-dimensional projection, the group assignment is en- 
tered through a graphical interactive interface (Fig. 2a- 
d. Supplementary Video SI). The integration of the vi- 
sual component allows the user not only to supervise the 
process, but also to learn and create intuition from tak- 
ing part in the process, thus facilitating the search for 
unanticipated network structures. It also accommodates 
naturally an ultimate goal of clustering algorithms, which 
is to reproduce how a human would group a given set of 
points. 

The chance of capturing a group structure is even fur- 
ther enhanced by the multiplicative effect of using more 
than one projection. Indeed, the separation probability 
in the example above rises from 0.68 to above 0.999 with 



just 7 projections. In general, for a given number L of 
random projections, the probability that all of these pro- 
jections fail to separate a given pair of group decreases 
to zero exponentially with L. After the user processes a 
given number L of projections, each node i in the network 
will be associated with a group assignment vector a*^*) 
representing the user input (Fig. 2d). Since we typically 
have a large number of distinct assignment vectors, we 
aggregate the corresponding nodes into a smaller, more 
meaningful number of structural groups by single-linkage 
hierarchical clustering'-. For this, we use the Hamming 
distance between the group assignment vectors of differ- 
ent nodes, a*^*) and a^-'^ which in this case is the number 
of projections for which the user has placed those nodes 
in different groups. This results in a dendrogram that we 
can cut at a threshold distance d to obtain a grouping, 
in which being in different groups indicates that the user 
has placed these nodes in different groups in at least d 
out of L projections (Fig. 2e; Supplementary Video SI). 
To compare the different groupings obtained at different 
thresholds, we define the quality of grouping Qg by 
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where vector x^'^ — (x^*"* , 
i in the property space, vector is the center of group fc, 
index ki denotes the group to which node i belongs, and 
1 1 • 1 1 defines the p-dimensional Euclidean distance. The 
ratio of the two bracketed quantities in Eq. (1) measures 
the average separation distance between groups (the av- 
erage over all pairs of groups, denoted ( • )k,i) relative to 
the spread within individual groups (the average over all 
nodes, denoted This quantity is then normalized 

by a constant g^, chosen to remove a systematic depen- 
dence of the quality of grouping on the number of groups 
K (see Methods). As one lowers the threshold level, the 
quality of grouping Qg tends to drop sharply at a cer- 
tain level (Fig. 2f). To obtain the maximum number of 
high-quality groups, we suggest choosing the group as- 
signment, as well as the number of groups K, at the 
threshold level just above the largest drop in Qg, which 
we call the Qg drop- off. 



Results 

We implemented our visual analytics method using the 
selection of p = 28 node properties listed in Table I, which 
encompasses important node attributes that capture lo- 
cal information, such as degree and clustering, and others 
that capture more global information, such as between- 
ness centrality and Laplacian eigenvectors. In particu- 
lar, the eigenvectors of the Laplacian and of the normal- 
ized Laplacian allow the detection of communities^-' *'*"'^'' 
and bipartite or multipartite structures , respectively, as 
well as mixtures of these structures, assuring our method 
the ability to detect group structures defined by link den- 
sity as special cases. Using this set of properties for the 
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Figure 2: Our visual analytics method, a, The p node properties x\l , . . . are computed for each node i in a given 
network of n nodes. The nodes are then represented as points in W, which are projected onto a randomly chosen two- 
dimensional subspace. b, c, Using a graphical interface, the user can either reject the projection (b), which indicates that 
there is no visible group separation, or indicate visible groups (c), which automatically assigns a group index to each node 
for that particular projection, d, Repeating this for a given number of random projections, each node i is associated with a 
group assignment vector a^*\ listing the group indices the user has assigned to node i. We used 30 projections for all results in 
this article, e, Dendrogram obtained by clustering the vectors a'^^, . . . , a'"-*. Cutting the dendrogram at a threshold Hamming 
distance d produces a grouping for the network, f, Quality of grouping Qg as a function of the threshold level d. The appropriate 
number of groups is determined to be /S' = 3 by thresholding at the Qg drop-off (dashed line). 



example network of Fig. 1, we obtain the dendrogram 
shown in Fig. 2e. The number of groups for this network 
is found to be = 3 at the Qg drop-off (Fig. 2f), which 
agrees with the group separation visible in the projection 
shown in Fig. lb. This accurately reflects the fact that 
the network was synthetically constructed from three dis- 
tinct structural groups: the first two groups character- 
ized by high (> 65) and low (< 55) prescribed degrees, 
respectively, but connected randomly otherwise, and the 
third group characterized by higher connection proba- 
bility with internal nodes (0.3) than with external ones 
(0.1). This example illustrates that our method is ca- 
pable of discovering not only group structures defined by 
link density, but also more general group structures, even 
when different types of structures coexist in the same net- 
work. Moreover, as shown in Fig. 3 for two-group bench- 
mark networks, the visual analytics method is generally 
expected to outperform existing methods if the groups 
have different internal structures, in this case determined 
by their different degree distributions (see Methods). 

Figure 4 shows a visualization of the hierarchy of 
nested structural groups identified by applying our 
method to a selection of six real- world networks spanning 
different sizes and domains (Table II). To further char- 
acterize these groups, we rank the node properties based 
on a two-dimensional projection in which the discovered 
groups reveal maximal separation (see Methods). We 
then discard the low-ranking properties that have negli- 
gible effect on the group separation, keeping only those 
indicated under each panel. Surprisingly, while most 
groups cannot be identified using a single node property, 
the node structural groups are completely separated in 



Table I: Node properties used to generate our results. 

j a;^*' (jth property of node i) 

1, 2 The degree" of node i and the average degree of the 
neighbors of node i 

3 The clustering coefficient'' of node i 

4 The average shortest path length from node i to all 
the other nodes 

5, 6 The betweenness centrality^ of node i and the average 
of the same quantity over the neighbors of node i 

7, 8 The subgraph centrality'' of node i and the average 
of the same quantity over the neighbors of node i 

9-18 The ith component of the eigenvector associated with 
the 2nd (smallest nonzero) through the 11th eigen- 
value of the Laplacian matrix*^ 

19-28 The ith components associated with the 10 largest 
eigenvalues of the normalized Laplacian matrix-'^ 

Here we consider only undirected and unweighted networks for 
simplicity. Each quantity is normalized to the unit interval [0, 1] 
before applying our analysis, which in this case reduces the node 
property space to the 28-dimensional unit hypercube. 
" The number of Unks attached to node i. 

' The fraction of pairs of neighbors of node i that are connected. 

The number of shortest paths passing through node i. 

The weighted sum of the number of closed paths in which node i 
participates . 

''The Laplacian matrix L is defined by Lij = — 1 if nodes i and 
j i are connected, L^j = if they are not connected, and L^j 
equals the degree of node i ii i = j. 

f The normalized Laplacian matrix is obtained by dividing each 
Lij by the degree of node i. 
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Figure 3: Performance comparison for detecting 
density-based communities. Using benchmark networks 
consisting of two groups, we compare the performance of the 
visual analytics method against alternative methods, mea- 
sured by the adjusted Rand index ' R between the computed 
and the true groupings (see Methods for the details of our 
benchmarking procedure). Our method (red filled circles) 
finds the correct group assignment almost perfectly for inter- 
group connection probability pout .$ 0.15, and performs rea- 
sonably well for larger values of Pout. The mixture model 
method' ' (blue open circles) performs well for pout ^ 0.10. 
The if -means algorithm ' ' (green open squares) shows con- 
sistently low performance. The use of nonlinear kernels, di- 
mensionality reduction based on principal component anal- 
ysis, and alternative schemes for assigning weights for node 
properties (see Methods) led to improved performance only 
for small Pout values. One of the largest such improvement is 
shown here (purple open triangles). In contrast, replacing the 
human user with the Jf-means algorithm to process the two- 
dimensional projections in our method (see Methods) shows 
significantly better performance (orange open diamonds) than 
the direct application of the A'-means variants to the node 
property space (green open squares and purple open trian- 
gles), although still worse than the visual analytics method 
(red filled circles) . This demonstrates both the effectiveness of 
the multiple random projection approach and the advantage 
of the human interactive component over unsupervised algo- 
rithms. Each point in the plot is the average of R computed 
after removing two outliers (smallest and largest R) from a 
total of 20 network realizations. The visual analytics method 
is generally expected to outperform existing methods if the 
groups have different internal structures, in this case deter- 
mined by their different degree distributions (see Methods). 



this plane for four of the networks. The groups in three of 
the networks, the polbooks, netscience, and disease net- 
works (Fig. 4d-f ), are separated using two eigenvectors of 
the Laplacian matrix, suggesting that these groups could 
be similar to density-based communities detected by ex- 
isting methods"; when quantified by the Rand index'^, 
however, the similarity appears relatively low (Supple- 
mentary Fig. S2). The groups in a fourth network, the 
karate network (Fig. 4a), can also be separated in a plane, 
but this projection requires the use of 15 properties led 



Table II: Networks analyzed. 



Dataset 


n 


m 


K 


cLh 


Pd 


karate 


34 


78 


3 


3 


15 


polbooks 


105 


441 


2 


2 


2 


adjnoun 


112 


425 


4 


2 


5 


football 


115 


613 


7 


3 


3 


netscience 


379 


914 


4 


4 


2 


disease 


516 


1188 


2 


5 


2 



We used the following datasots: karate, the social network of 
Zachary's Karate club " ; polbooks, a network of political books^"; 
adjnoun, a network of English words ; football, a network of col- 
legiate American football teams- ; netscience, a network of net- 
work scientists ; and disease, a human disease network (see 
Supplementary Table SI for a more detailed description of these 
networks). Here n is the number of nodes, m is the number of 
links, K is the number of structural groups at the Qg drop-off, du 
is the Hamming distance at the Qg drop-off, and is the minimum 
number of node properties necessary to produce a two-dimensional 
projection in which most or all of the K discovered groups are 
visually separated. 

by the average degree, average betweenness, and average 
subgraph centrality ' ' of neighbors (see Table I notes for 
the definition). The groups in the other two networks, 
the adjnoun and football networks (Fig. 4b-c), are mostly 
but not completely separated in this two-dimensional 
representation. We emphasize that it is not necessary for 
all the groups to be separable in a single two-dimensional 
projection. In fact, while each such projection may only 
illuminate part of the hidden group structure (such as 
the separation between a single group and all the oth- 
ers), the multiplicative effect of integrating information 
from many random projections is what often reveals the 
full high-dimensional structure. 

Another remarkable feature of this approach is that, 
because we do not know in advance which properties 
define the groups we seek to identify, the visual an- 
alytics method simultaneously provides the answer to 
the question — the number and identity of the structural 
groups — along with the question itself — the properties 
that define these groups. Even when these properties 
are abstract, further analysis can easily reveal the nature 
of the network's internal organization. For example, con- 
sider the karate network, whose nodes are members of a 
karate club and links are interactions between two mem- 
bers in at least one context external to the club activities. 
The three structural groups identified in Fig. 4a corre- 
spond to (1) members who are central to the club and 
interact with many other members; (2) peripheral mem- 
bers interacting only with very few, but central mem- 
bers; and (3) members forming a community connected 
to the rest of the network only through one central mem- 
ber (Supplementary Fig. S3). Incidentally, one of the 
groups we identify consists of nodes that are connected 
to those outside the group but to none within the group. 
This social group structure is markedly different from the 
well-studied eventual split of the club into two clubs '". 
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Figure 4: Hierarchical group structures discovered for the networks in Table II. Each panel shows the network nodes 
plotted in a two-dimensional projection of the node property space. For panels a-c, the two coordinate axes, zi and Z2, are 
linear combinations of a selection of node properties that capture most of the group separation, with corresponding coefficients 
listed in the table below each plot. Note that in panels b and c two dimensions are not sufficient to cleanly separate the groups, 
even though our method resolves this separation by combining multiple two-dimensional projections. For panels d-f, only two 
properties are necessary to resolve the groups clearly. The plot on the right of each panel shows the quality of grouping Qg as 
a function of the hierarchical level measured by the Hamming distance. The groupings corresponding to the hierarchical levels 
in the blue part of this plot are indicated in the projections by the shades of blue. 



As an additional example, consider the football net- 
work, where nodes are college American football teams 
and links indicate matches played in the 2000 season. 
Although the teams are organized into 12 conferences 
(including Independents), our method identifies 7 struc- 
tural groups (Fig. 4c). As shown in Figs. 5a and 5b, 
groups 1 and 6 are characterized by the combination 
of high degrees, high subgraph centrality, and the same 
characteristics for their neighbors, while these two groups 
are distinct in clustering coefficient and some Laplacian 
eigenvectors. Low degrees and low subgraph centrality, 
as well as the same characteristics for the neighbors, dis- 
tinguish groups 4 and 7 from others, while they differ in 
their clustering coefficient and a few Laplacian eigenvec- 
tors. Group 2 shows similar characteristics as group 1 
in terms of subgraph centrality, but the mean shortest 
path distance is very high and the betweenness central- 
ity of the neighbors is very low, reflecting the peripheral 
location of these nodes within the network. Many of 
the Laplacian eigenvectors contribute to the separation 
of the groups, which is consistent with the fact that a 
density-based community structure exists in addition to 
other group structures. In particular, groups 3 and 5 
are communities that can only be distinguished by the 
differences in the Laplacian eigenvectors and clustering 



coefficient. Grouping together Big Twelve and Mountain 
West as well as Atlantic Coast and Big East, but split- 
ting the Independents (Fig. 5c) , this group structure cap- 
tures a higher-level organization of the conferences which 
is determined by the geographic proximity of the teams 
(Fig. 5d). Similar geographical manifestation of network 
communities has recently been observed in the effective 
boundaries defined by human mobility in the US and 
telecommunications in Great Britain . 



Discussion 

The structural groups identified by the visual analytics 
method are characterized by common network properties. 
This provides a foundation for the study of the interplay 
between form and function in complex networks, as net- 
work dynamics (and hence function) is believed to be 
strongly influenced by network structure. The possibil- 
ities are extensive with our approach since the user has 
complete freedom to choose the set of p node properties. 
Within the wide range of possible structures expressible 
through these properties, the visual analytics method can 
help discover a specific group structure of interest and 
interpret it using a ranking of the node properties. The 
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Figure 5: Characterizing seven structural groups discovered in the football network, a, Average node properties 
of the seven groups. Rows correspond to node properties and columns to groups (the colored disks at the top). Using the 
orange color-scale on the left, each cell shows the average node property of the group, relative to the network average and 
in units of the network standard deviation, b. Node property distribution within each group. The seven groups in this plot 
are color-coded as in the disks at the top of panel a. Small dots indicate the individual values for each node in the network, 
larger dots connected by lines indicate the group averages, and bars indicate the range of values for each group. All values are 
measured relative to the network average and in units of the network standard deviation, c, Layout of the network with the 
structural groups indicated by circles, color-coded as in the other panels. The number and color on a node indicate the college 
football conference to which the corresponding team belongs, as listed at the bottom of the panel, d, Geographic distribution 
of nodes (teams) over the US, color-coded by the structural groups as in panel c. The fact that more than one conference is 
grouped together as groups 3, 4, and 5 can be interpreted in terms of the proximity of the teams' geographic location and its 
impact on the structure of the network. 



approach can be easily adapted to identify network struc- 
tures defined by link rather than node characteristics . 
Moreover, it can be applied to networks whose nodes have 
quantifiable (but not necessarily structural) properties'*^^, 



such as age, income and level of education in the case 
of social networks, which remain elusive in existing net- 
work representations. Systematic benchmarking using 
synthetic networks shows that our method has advan- 
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tages over existing methods in identifying density-based 
communities with distinct internal structures (red vs. 
blue curve in Fig. 3). Naturally, existing methods such 
as the one proposed in Ref. I 4 may still be more effec- 
tive in resolving specific networks not represented in our 
benchmarks. In finding general structural groups beyond 
density-based communities, the visual analytics method 
outperforms the direct application of standard clustering 
algorithms in the full node property space (Fig. 1; Sup- 
plementary Fig. SI; red vs. green/purple curve in Fig. 3). 
This suggests that our approach also has potential to be 
an alternative for solving general high-dimensional clus- 
tering problems. The replacement of the human compo- 
nent in the visual analytics method with a simple heuris- 
tics based on K-means yields a fully objective unsuper- 
vised algorithm, which performs much better than var- 
ious extensions of X-means directly applied to the full 
node property space (orange vs. green/purple curves in 
Fig. 3) . This highlights the critical role played by the in- 
tegrative analysis of clustering outputs from multiple pro- 
jections. Although the visual analytics method converted 
to an unsupervised algorithm performs better than stan- 
dard unsupervised approaches, the original formulation 
with the human component is still more effective (red vs. 
orange curve in Fig. 3) . By combining the pattern recog- 
nition ability of humans with the processing capability of 
computers, our visual analytics method can resolve the 
internal organization of complex networks better than ei- 
ther of them alone. 



Methods 

Biased random projections. To enhance the prob- 
ability of resolving group separation, we first choose 
each node property j with probability rj (while requir- 
ing a minimum of four properties) and generate a ran- 
dom projection using those selected properties. The 
probability rj is designed to reflect the relative impor- 
tance of property j in separating the groups. We set 
rj := [vj/ ma,-Xj{vj)]°', where Vj := X^fc^fc^fej^ ^-^d Vkj 
denotes the jth component of the normalized basis vec- 
tor for the fcth (out of N) one-dimensional projections 
generated randomly and uniformly. The weights Wk are 
given by Wfe maxj(zfe,i+i - Zfc,i) • (i/n) • (1 -z/n), where 
Zk,i < Zk,2 < ■ • • < Zk^n denote the ordered points in the 
fcth projection for all n nodes in the network. The pa- 
rameter a can be used to adjust the bias strength and 
was taken to be 2 in all computations. 

Controlling for group-size effect in Qg. Since 
smaller groups naturally tend to have smaller within- 
group variations, the ratio of the averages in Eq. (1) 
increases with the number of groups K, even when the 
groups are not necessarily better separated. To correct 
for this bias, we define Qg by normalizing the ratio by 
its expected value qg for randomized groupings with the 
individual group sizes kept fixed. We estimated Qg by 
averaging over 100 realizations. 



Two-group benchmark networks. For the bench- 
marking results shown in Fig. 3, we used networks hav- 
ing two groups, constructed as follows. In the larger 
group (150 nodes), nodes are connected randomly, with 
the degree of each node fixed to a random integer chosen 
uniformly between 10 and 70. In the smaller group (50 
nodes), node pairs are connected randomly with fixed 
probability pi^- Across the two groups, node pairs are 
connected with probability Pout- For a given Pout, we 
choose Pin to match the average degree in the smaller 
group with the average internal degree in the larger 
group. The probability Pout is varied between (two 
completely isolated groups) and 40/150 ~ 0.27 (no inter- 
nal links in the smaller group), withpout = 20/150 « 0.13 
corresponding to the point at which the average internal 
and external degrees in the smaller group are equal. 

Benchmarking procedure. We used the two-group 
network described in the subsection above to com- 
pare performance of various methods for identifying the 
groups. For our visual analytics method, we used the 
node properties listed in Table I and generated 30 biased 
random projections. The threshold level for the result- 
ing dendrogram was selected so as to produce two groups. 
In a few cases where a two-group threshold does not ex- 
ist, we selected the threshold that results in the smallest 
possible number of groups above two. For the mixture 
model method^ \ the number of groups was set to if = 2. 
For ii'-means"^'^' , the algorithm was applied directly to 
the node property space with K — 2. For completeness, 
we also examined the performance of if-means using all 
possible combinations of choices for (i) kernel'*' (linear, 
polynomial, Gaussian, or sigmoid); (ii) dimensionality re- 
duction (projecting the data points in the node property 
space onto the 2, 5, 10, 15, or 20 leading principal compo- 
nents, or no reduction); and (iii) normalization (scaling 
each node property to have zero mean and unit variance, 
normalizing each property to the unit interval [0,1], or 
no normalization). Scaling for zero mean and unit vari- 
ance is equivalent to weighing each node property equally 
when measuring distances in the node property space, 
while normalizing to the unit interval ensures that all the 
node properties are distributed in the same range. For 
the unsupervised variant of our visual analytics method, 
the human user was replaced by the (linear) K-means 
algorithm with K = 1,2,..., 10 to analyze each two- 
dimensional projection, with an optimal choice of K de- 
termined by the gap statistic , which is defined based 
on a characteristic signature in the if -dependence of the 
within-group variation. The performance of each method 
was measured by the adjusted Rand index R between the 
computed and the true groupings (see the subsection be- 
low for definition). 

Rand index. This index measures the similarity be- 
tween two ways of grouping a given set of discrete ob- 
jects, possibly into different numbers of groups. For a 
given pair of groupings of network nodes, the adjusted 
Rand index R is defined as the normalized fraction of 
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node pairs that are either classified in the same group in 
both groupings or classified in different groups in both 
groupings'''^. The normalization implies that R = 1 for 



identical groupings and R ■ 
ings. 



for a pair of random group- 



Ranking node properties. For a given node group- 
ing, we seek a two-dimensional projection that maximizes 



{nk\\Ck-c\ 



a group separation mea- 



sure similar to that in Eq. (1) but computed for the pro- 
jected points after the groups have been identified. Here 
Ufc denotes the number of nodes in group k, and c de- 
notes the center of all the data points. Such a projection 
plane can be efficiently found by a spectral method^'"" 
based on the QR decomposition. The node properties 
are then ranked in the order of increasing angle between 
their coordinate axes and the projection plane. 



this article is available at http://purl.oclc.org/net/ 
f ind_structural_groups 
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Supplementary Figure SI: Comparing hierarchical clustering methods, a, b, Dendrograms obtained by applying the 
single-linkage (a) and complete-linkage (b) clustering algorithms'^^ to the full 28-dimensional node property space. We use an 
example network with two structural groups defined by their degree distributions, which we construct by randomly connecting 
nodes with prescribed degrees. A group of 150 low-degree nodes (degree between 10 and 50) and a group of 30 high-degree 
nodes (degree between 70 and 90) form a bimodal degree distribution, bridged by 5 nodes with degree between 50 and 70. 
All node degrees are chosen uniformly at random from the corresponding intervals. The nodes (red dots) are plotted in the 
two-dimensional projection using the degree and the average degree of neighbors, both normalized to the unit interval. For these 
two dendrograms, the distance between two nodes is measured by the Euclidean distance in the full node property space. In 
the single-linkage dendrogram (a) the height at which two groups join is the minimum distance between all pairs of nodes from 
the two groups, while in the complete-linkage dendrogram (b) it is the maximum distance between node pairs, c. Dendrogram 
obtained by our visual analytics method. In applying single-linkage clustering to obtain the dendrogram, the measure used for 
distance between nodes i and j is the Hamming distance between the user input vectors a*-'^ and a'-*^ (which can be defined in 
this case, but not in the other two cases). Two or more groups joined in this dendrogram at height d implies that any pair of 
nodes taken from two of these groups has been separated visually by the user in at least d different projections. We see that 
thresholding at a fixed level in this dendrogram would clearly produce the correct splitting of nodes into the two structural 
groups, while the dendrograms in the other two panels do not capture the group structure. This comes from the difficulty in 
recognizing groups with wide distribution of points, also suffered by the Ji"- means algorithm (see Fig. Id). 
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Supplementary Figure S2: Comparing discovered groups with communities found by a traditional method. For 

each network in Table II, we compare the grouping obtained at a fixed hierarchical level in our method with the division into 
the same number of groups that is found by the method of Ref. 2, which is based on link betweenness centrality. The similarity 
of the two groupings is measured by the adjusted Rand index R (defined in Methods). Each panel shows 7? as a function of 
the hierarchical level measured by the Hamming distance in our method, with the corresponding number of groups indicated 
in the plots. The R values significantly lower than unity are observed at most hierarchical levels for each network, confirming 
that the group structures we discovered are indeed different from the traditional community structures. 
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Supplementary Figure S3: Characterizing three structural groups discovered in the karate network, a, Average 
node properties of the three groups. Rows correspond to node properties and columns to groups (red, green, and blue disks 
at the top). Using the orange color-scale on the left, each cell shows the average node property of the group, relative to the 
network average and in units of the network standard deviation, b. Node property distribution within each group. The three 
groups in this plot are color-coded as in the disks at the top of panel a. Small dots indicate the individual values for each 
node in the network, larger dots connected by lines indicate the group averages, and bars indicate the range of values for each 
group. All values are measured relative to the network average and in units of the network standard deviation, c. Layout of the 
network with groups color-coded as in the other panels. The dashed curve indicates the eventual split of the karate club into 
two different clubs, as documented in Ref. 40. It is clear that the group structure discovered by our visual analytics approach 
is distinct from the club split. Group 1 (red) is characterized by low degree, clustering coefficient of one, and large values of 
the components of the first normalized Lapiacian eigenvector, as well as by being connected to high-degree nodes with high 
betweenness and subgraph centrality (the first column of panel a and red plots in panel b). Group 2 (green) is characterized by 
very low values of the components of the second Lapiacian eigenvector, which is indicative of a traditional community structure. 
This is reflected in panel c, where green nodes form a cluster at the top. High values of mean shortest path distance to other 
nodes for nodes in Groups 1 and 2 implies that these nodes sit in peripheral locations within the network, which is confirmed 
in the network layout in panel c. Group 3 (blue) forms the core of the network and includes all the high-degree, high-centrality 
nodes, which is refiected in the high group average in these quantities. This example network illustrates that our method 
identifies groups of nodes with common structural properties that do not even need to be connected within each group. 




Supplementary Video SI: Visual analytics software for discovering structural groups in networks. The network 
shown in Fig. 1 is used to demonstrate how an implementation of the method introduced in this article discovers the three 
groups of nodes characterized by the degree and the average degree of neighbors. The default set of node properties is the same 
as in Table I, and we use 30 random projections with the bias described in Methods to enhance the probability that the user 
sees group separation. Only a selection of these projections are actually shown in the movie in order to keep the presentation 
short. 
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Supplementary Table SI: Description of networks analyzed. 



Dataset 




Description 


Reference 


karate 


Node: 
Link: 


Social network of a university-based karate club 

A club member 

Interaction between the two members in at least one context outside the club activity 


40 


polbooks 


Node: 
Link: 


Network of books on American politics bought from Amazon.com 
A book 

Frequent purchase of the two books together by the same buyer 




adjnoun 


Node: 
Link: 


Network of nouns and adjectives appearing in a novel {David Copperfield by Charles 
Dickens) 

A noun or adjective 

Appearance of the two words adjacent to each other in the book 


36 


football 


Node: 
Link: 


Network of collegiate American football teams in the US 

A football team 

The fact that one or more games were played between the two teams in the 2000 
regular season 


2 


netscience 


Largest connected component of the network of scientists who have published papers 


oo 






on network science 






Node: 


A scientist 






Link: 


Coauthorship between the two scientists 




disease 


Node: 
Link: 


Largest connected component of the network of known human genetic disorders 
A genetic disorder 

Existence of a common gene whose mutation is associated with both disorders 


50 



